Executive Summary

Predicting Expected Monthly Income and Expected Attrition is critical to understanding how to manage talent within any organization.

“According to the U.S. Bureau of Statistics, the average turnover rate in the U.S. is about 12% to 15% annually. According to LinkedIn, an average annual worldwide employee turnover rate is 10.9%.” source - https://www.talentlyft.com/en/blog/article/242/hr-metrics-how-and-why-to-calculate-employee-turnover-rate#:~:text=According%20to%20the%20U.S.%20Bureau,above%20the%20average%20turnover%20rates.

The turnover rate will of course vary from company to company but the data highlights a key insight that a significant portion of your employee population will need to be replaced each year. When it comes to Monthly Salary, the formula is much more complicated than paying everyone based on position title. Company’s must consider items like work experience, education, and a variety of other factors to determine fair compensation. Utilizing machine learning algorithms, we will use the relevant data provided to predict these two factors for Frito Lay.

Introduction

Frito Lay has provided a detailed data set including 36 variables on their employees. The requested objective from their team is to provide any general insights found and determine a model to predict Attrition and another model to predict Monthly Income. Our team at DDSAnalytics have broken down the data to find the key factors driving these metrics. We then leveraged these factors using machine learning algorithms like Naive Bayes and Linear Regression to predict each of the requested variables.

Loading and Cleaning Data

#Load required libraries
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(naniar)
library(ggplot2)
library(ggthemes)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ forcats 0.5.0
## ✓ readr   1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(corrplot)
## corrplot 0.84 loaded
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(class)
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
#Read in data and remove irrelevant columns
ee=read.csv("~/Documents/SMU_DS/Doing Data Science/Case_Study_2_DDS/Raw_Data_Files/CaseStudy2-data.csv",header=TRUE,stringsAsFactors = TRUE,na.strings = c("","NA"))
str(ee)
## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : int  2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ JobSatisfaction         : int  4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : int  2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...
ee=ee %>% select(-EmployeeCount,-StandardHours,-Over18,-EmployeeNumber)


#Where applicable, create alternative numerical version of the factor
summary(ee$PerformanceRating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   3.000   3.152   3.000   4.000
ee$fctrPerformanceRating=ifelse(ee$PerformanceRating<=3,'Low','High')
ee$fctrPerformanceRating=as.factor(ee$PerformanceRating)
ee$WorkLifeBalance_fctr
## NULL
ee=ee %>% mutate(DistanceFromHome_cat=ifelse(DistanceFromHome>19,"Very Far",                                    ifelse(DistanceFromHome>9,"Far",                                                                       ifelse(DistanceFromHome>3,"Close","Very Close"))))
ee$DistanceFromHome_cat=factor(ee$DistanceFromHome_cat)

ee=ee %>% mutate(numBusinessTravel=ifelse(BusinessTravel=="Non-Travel",0,
                                   ifelse(BusinessTravel=="Travel_Rarely",1,
                                          ifelse(BusinessTravel=="Travel_Frequently",2,3
                                          ))))

ee$numDepartment=ifelse(ee$Department=="Research & Development",0,1)

ee$numGender=ifelse(ee$Gender=="Female",0,1)

ee=ee %>% mutate(numMaritalStatus=ifelse(MaritalStatus=="Single",0,
                                   ifelse(MaritalStatus=="Married",1,
                                          ifelse(MaritalStatus=="Divorced",2,3
                                          ))))

ee$numOvertime=ifelse(ee$OverTime=="No",0,1)

ee$numAttrition=ifelse(ee$Attrition=="No",0,1)

#Look at scatterplot of all variables
test=ee %>% filter(Attrition=="Yes")
ee_num=ee %>% select_if(is.numeric)
ee_fctr=ee %>% select_if(is.factor)

ggplot(gather(ee_num), aes(value)) + 
    geom_histogram(bins = 15) +
    facet_wrap(~key, scales = 'free_x')

#Look at linear model of JobSatisfaction and Monthly Income 
ee %>% ggplot(aes(JobSatisfaction,MonthlyIncome,col=JobRole))+geom_smooth(method="lm")+ggtitle("Monthly Income vs. Job Satisfaction by Role")
## `geom_smooth()` using formula 'y ~ x'

#Look at specific values of Job Satifaction against Job Role
sat=ee %>% group_by(JobRole) %>% summarise(mean=mean(JobSatisfaction),median=median(JobSatisfaction))
## `summarise()` ungrouping output (override with `.groups` argument)
sat=data.frame(sat)
sat %>% arrange(desc(mean))
##                     JobRole     mean median
## 1 Healthcare Representative 2.828947      3
## 2        Research Scientist 2.802326      3
## 3           Sales Executive 2.725000      3
## 4    Manufacturing Director 2.724138      3
## 5      Sales Representative 2.698113      3
## 6     Laboratory Technician 2.686275      3
## 7           Human Resources 2.555556      3
## 8                   Manager 2.509804      2
## 9         Research Director 2.490196      3
sat
##                     JobRole     mean median
## 1 Healthcare Representative 2.828947      3
## 2           Human Resources 2.555556      3
## 3     Laboratory Technician 2.686275      3
## 4                   Manager 2.509804      2
## 5    Manufacturing Director 2.724138      3
## 6         Research Director 2.490196      3
## 7        Research Scientist 2.802326      3
## 8           Sales Executive 2.725000      3
## 9      Sales Representative 2.698113      3
#Visualize Job Satifaction against Job Role
ee %>% ggplot(aes(JobSatisfaction,reorder(JobRole,JobSatisfaction)))+geom_boxplot(fill='deepskyblue1',)+ggtitle("Job Satisfaction by Role")+theme(text = element_text(size=28))+stat_summary(fun.y=mean, geom="point", shape=23, size=5, color="red", fill="red")+xlab("Job Role")+ylab("Job Satisfaction")
## Warning: `fun.y` is deprecated. Use `fun` instead.

#Visualize Monthly Income against Job Role
ee %>% ggplot(aes(MonthlyIncome,reorder(JobRole,MonthlyIncome)))+geom_boxplot(fill='deepskyblue1',)+ggtitle("Monthly Income by Role")+theme(text = element_text(size=28))+stat_summary(fun.y=mean, geom="point", shape=23, size=3, color="red", fill="red")+xlab("Job Role")+ylab("Monthly Income")
## Warning: `fun.y` is deprecated. Use `fun` instead.

We have now created a numerical variable alternatives for each given variable where the variables can be converted based on category and ordinal placement. Also, we visualized each of the distributions for the variables. We can see many of the distributions are skewed. We also visualized Job Satisfaction by Job Role and Monthly income by Job Role. Each plot is organized by mean value.

#KNN Analysis on Attrition

#Test variables with Wicox Test for significance
vars_num=variable.names(ee_num[,-1])
p_matrix= matrix(nrow=length(ee_num)-1, ncol=3)
for (i in 1:length(vars_num)){
a=wilcox.test(ee_num[,(i+1)]~ee$Attrition,alternative="two.sided")
p_matrix[i,1]=vars_num[i]
p_matrix[i,2]=a[[3]]
p_matrix[i,3]=ifelse(a[[3]]<=.0017,"keep","remove")
}

p_matrix
##       [,1]                       [,2]                    [,3]    
##  [1,] "Age"                      "2.05142624461662e-06"  "keep"  
##  [2,] "DailyRate"                "0.299457401638044"     "remove"
##  [3,] "DistanceFromHome"         "0.0272542867612612"    "remove"
##  [4,] "Education"                "0.132737345594014"     "remove"
##  [5,] "EnvironmentSatisfaction"  "0.0384660720848109"    "remove"
##  [6,] "HourlyRate"               "0.29009357195183"      "remove"
##  [7,] "JobInvolvement"           "7.73304810764929e-07"  "keep"  
##  [8,] "JobLevel"                 "2.49503721487344e-08"  "keep"  
##  [9,] "JobSatisfaction"          "0.00118261722782567"   "keep"  
## [10,] "MonthlyIncome"            "4.07351265527439e-09"  "keep"  
## [11,] "MonthlyRate"              "0.208563502820914"     "remove"
## [12,] "NumCompaniesWorked"       "0.17226278333996"      "remove"
## [13,] "PercentSalaryHike"        "0.975981196105236"     "remove"
## [14,] "PerformanceRating"        "0.651465117867211"     "remove"
## [15,] "RelationshipSatisfaction" "0.29777866495085"      "remove"
## [16,] "StockOptionLevel"         "3.0761263616543e-09"   "keep"  
## [17,] "TotalWorkingYears"        "4.04234499349295e-09"  "keep"  
## [18,] "TrainingTimesLastYear"    "0.0922434095280687"    "remove"
## [19,] "WorkLifeBalance"          "0.034430719605193"     "remove"
## [20,] "YearsAtCompany"           "3.10980498624275e-08"  "keep"  
## [21,] "YearsInCurrentRole"       "9.48224070619458e-08"  "keep"  
## [22,] "YearsSinceLastPromotion"  "0.368074119349138"     "remove"
## [23,] "YearsWithCurrManager"     "9.34684457405231e-07"  "keep"  
## [24,] "numBusinessTravel"        "0.0168665223239186"    "remove"
## [25,] "numDepartment"            "0.00291697843641116"   "remove"
## [26,] "numGender"                "0.456795406955404"     "remove"
## [27,] "numMaritalStatus"         "5.35518941999522e-09"  "keep"  
## [28,] "numOvertime"              "1.06539927859169e-15"  "keep"  
## [29,] "numAttrition"             "5.42845762548052e-191" "keep"
p_vals=data.frame(p_matrix)
colnames(p_vals)=c("var","p_value","sig")
p_keep=p_vals %>% filter(sig=="keep")

p_keep$p_value=as.numeric(p_keep$p_value)
p_keep=p_keep %>% arrange(p_value)
p_keep %>% filter(var!='numAttrition')
##                     var      p_value  sig
## 1           numOvertime 1.065399e-15 keep
## 2      StockOptionLevel 3.076126e-09 keep
## 3     TotalWorkingYears 4.042345e-09 keep
## 4         MonthlyIncome 4.073513e-09 keep
## 5      numMaritalStatus 5.355189e-09 keep
## 6              JobLevel 2.495037e-08 keep
## 7        YearsAtCompany 3.109805e-08 keep
## 8    YearsInCurrentRole 9.482241e-08 keep
## 9        JobInvolvement 7.733048e-07 keep
## 10 YearsWithCurrManager 9.346845e-07 keep
## 11                  Age 2.051426e-06 keep
## 12      JobSatisfaction 1.182617e-03 keep
p_keep_vars=p_keep[,1]
scaled_vars=ee_num %>% select(all_of(p_keep_vars))

#Create Over-Sample to ballance Attrition
holder=scaled_vars
for(f in 1:10){
set.seed(f)
att_rows=holder[holder$Attrition==1,]
extrarows= sample(1:dim(att_rows)[1],round(.42143*dim(att_rows)[1]))
add_df=att_rows[extrarows,]
scaled_vars=rbind(scaled_vars,add_df)
}
str(scaled_vars)
## 'data.frame':    870 obs. of  13 variables:
##  $ numAttrition        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ numOvertime         : num  0 0 0 0 1 0 1 1 1 0 ...
##  $ StockOptionLevel    : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears   : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ MonthlyIncome       : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ numMaritalStatus    : num  2 0 0 1 0 2 1 2 1 1 ...
##  $ JobLevel            : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ YearsAtCompany      : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole  : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ JobInvolvement      : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ YearsWithCurrManager: int  3 9 2 7 3 7 3 0 0 7 ...
##  $ Age                 : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ JobSatisfaction     : int  4 3 4 4 4 1 3 4 3 3 ...
str(ee_num)
## 'data.frame':    870 obs. of  30 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EnvironmentSatisfaction : int  2 3 3 3 1 4 2 4 3 4 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ JobSatisfaction         : int  4 3 4 4 4 1 3 4 3 3 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 1 3 3 3 3 1 3 4 2 ...
##  $ StockOptionLevel        : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : int  2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...
##  $ numBusinessTravel       : num  1 1 2 1 2 2 1 1 1 2 ...
##  $ numDepartment           : num  1 0 0 1 0 0 0 1 1 0 ...
##  $ numGender               : num  1 1 1 0 0 1 1 0 0 1 ...
##  $ numMaritalStatus        : num  2 0 0 1 0 2 1 2 1 1 ...
##  $ numOvertime             : num  0 0 0 0 1 0 1 1 1 0 ...
##  $ numAttrition            : num  0 0 0 0 0 0 0 0 0 0 ...
#Scale Variables and Test KNN across K values
outcome_data= scaled_vars$numAttrition
scaled_vars=scaled_vars %>% select(-numAttrition)
scaled_vars=data.frame(scale(scaled_vars))
train_cols= scaled_vars
num_rand_samples= 15
max_k= 30
accuracy_matrix= matrix(nrow= num_rand_samples, ncol=max_k)
sensitivity_matrix= matrix(nrow= num_rand_samples, ncol=max_k)
specificity_matrix= matrix(nrow= num_rand_samples, ncol=max_k)
for(i in 1:num_rand_samples)
{
  set.seed(i)
  
  for(j in 1:max_k)
  {
    class_data= knn.cv(train_cols,cl = outcome_data, k=j)
    CM= confusionMatrix(table(class_data,outcome_data))
    accuracy_matrix[i,j]= CM$overall[1]
    sensitivity_matrix[i,j]= CM$byClass[1]
    specificity_matrix[i,j]= CM$byClass[2]
    
  }
}
MeanAcc = colMeans(accuracy_matrix)
MeanSens = colMeans(sensitivity_matrix)
MeanSpec= colMeans(specificity_matrix)
which(MeanAcc==max(MeanAcc))
## [1] 5 9
which(MeanSens==max(MeanSens))
## [1] 30
which(MeanSpec==max(MeanSpec))
## [1] 2
max(MeanAcc)
## [1] 0.8655172
max(MeanSens)
## [1] 0.9866667
max(MeanSpec)
## [1] 0.4004762
par(mfrow=c(1,1))

#Viualize results
plot(x = seq(1,max_k,1),y = MeanAcc, ylim=c(0,1),type = "l", main= "KNN Classification of Attrition based on Selected Variables", xlab= "k Value", ylab= "Mean Acc (black) / Mean Spec (blue) / Mean Sens (red)",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
lines(x = seq(1,max_k,1),y = MeanSens, ylim=c(0,1),type = "l", col="red")
lines(x = seq(1,max_k,1),y = MeanSpec, ylim=c(0,1),type = "l", col="blue")

summary(ee$Attrition)
##  No Yes 
## 730 140

To counteract the sample imbalance in Attrition in all upcoming models, we will use an over-sample technique and randomly draw from the ‘Yes’ Attrition values until the sample is balanced. KNN utilized numerical values only and requires them to be scaled. The variables were selected with a Wilcox Rank Sum Test to predict Attrition and the p-value criteria of 0.0017. The result as plotted shows the highest sensitivity KNN can achieve is approximately 0.4 which is below the requirement. We need a different algorithm.

#Naive Bayes Analysis on Attrition

#Deterimine Variables using Chi-Squared Test
ee_nb_options=ee[,-c(1,3,34:40)]
vars2=variable.names(ee[,-c(1,3,34:40)])
p_matrix= matrix(nrow=length(vars2), ncol=3)
for (i in 1:length(vars2)){
a=chisq.test(ee$Attrition,ee_nb_options[,(i)])
p_matrix[i,1]=vars2[i]
p_matrix[i,2]=a$p.value
p_matrix[i,3]=ifelse(a$p.value<=.0017,"keep","remove")
}
## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_nb_options[, (i)]): Chi-squared
## approximation may be incorrect
p_matrix
##       [,1]                       [,2]                   [,3]    
##  [1,] "Age"                      "0.000455925966956391" "keep"  
##  [2,] "BusinessTravel"           "0.0499253583464617"   "remove"
##  [3,] "DailyRate"                "0.394359650210051"    "remove"
##  [4,] "Department"               "0.00942377313351953"  "remove"
##  [5,] "DistanceFromHome"         "0.041015227721121"    "remove"
##  [6,] "Education"                "0.624283776640735"    "remove"
##  [7,] "EducationField"           "0.268219767928874"    "remove"
##  [8,] "EnvironmentSatisfaction"  "0.0105408976789624"   "remove"
##  [9,] "Gender"                   "0.515129674660433"    "remove"
## [10,] "HourlyRate"               "0.165336227368764"    "remove"
## [11,] "JobInvolvement"           "5.21104090021876e-09" "keep"  
## [12,] "JobLevel"                 "2.08470324574323e-08" "keep"  
## [13,] "JobRole"                  "3.64683624511309e-10" "keep"  
## [14,] "JobSatisfaction"          "0.0111512175494692"   "remove"
## [15,] "MaritalStatus"            "3.37894621259824e-08" "keep"  
## [16,] "MonthlyIncome"            "0.548532282023289"    "remove"
## [17,] "MonthlyRate"              "0.536630277147251"    "remove"
## [18,] "NumCompaniesWorked"       "0.0167775870740727"   "remove"
## [19,] "OverTime"                 "2.33298071371539e-15" "keep"  
## [20,] "PercentSalaryHike"        "0.277627401358292"    "remove"
## [21,] "PerformanceRating"        "0.746170559364242"    "remove"
## [22,] "RelationshipSatisfaction" "0.372711655350202"    "remove"
## [23,] "StockOptionLevel"         "3.7244644710761e-12"  "keep"  
## [24,] "TotalWorkingYears"        "0.000407183992741349" "keep"  
## [25,] "TrainingTimesLastYear"    "0.119204353121313"    "remove"
## [26,] "WorkLifeBalance"          "0.00249508979751358"  "remove"
## [27,] "YearsAtCompany"           "7.89367480428311e-05" "keep"  
## [28,] "YearsInCurrentRole"       "0.000792886613459846" "keep"  
## [29,] "YearsSinceLastPromotion"  "0.129359276246726"    "remove"
## [30,] "YearsWithCurrManager"     "0.000258107721542933" "keep"  
## [31,] "fctrPerformanceRating"    "0.746170559364242"    "remove"
p_vals=data.frame(p_matrix)
colnames(p_vals)=c("var","p_value","sig")
p_keep=p_vals %>% filter(sig=="keep")
p_keep$p_value=as.numeric(p_keep$p_value)
p_keep=p_keep %>% arrange(p_value)
p_keep
##                     var      p_value  sig
## 1              OverTime 2.332981e-15 keep
## 2      StockOptionLevel 3.724464e-12 keep
## 3               JobRole 3.646836e-10 keep
## 4        JobInvolvement 5.211041e-09 keep
## 5              JobLevel 2.084703e-08 keep
## 6         MaritalStatus 3.378946e-08 keep
## 7        YearsAtCompany 7.893675e-05 keep
## 8  YearsWithCurrManager 2.581077e-04 keep
## 9     TotalWorkingYears 4.071840e-04 keep
## 10                  Age 4.559260e-04 keep
## 11   YearsInCurrentRole 7.928866e-04 keep
keep_len=dim(p_keep)[1]
MeanAcc_nb=matrix(nrow= keep_len)
MeanSens_nb=matrix(nrow= keep_len)
MeanSpec_nb=matrix(nrow= keep_len)

#Big loop to loop through adding a parameter by significance
for(n in 1:keep_len){
p_keep_select=head(p_keep,n)
p_keep_vars=p_keep_select[,1]

ee_nb=ee %>% select(all_of(p_keep_vars))
ee_nb_num=ee_nb%>% select_if(is.numeric)
ee_nb_fctr=ee_nb%>% select_if(is.factor)
ee_nb_num=scale(ee_nb_num)
ee_nb_w_response=cbind(ee_nb_num,ee_nb_fctr,Attrition=ee$Attrition)

ee_nb_final=ee_nb_w_response

#Loop to create Over-Sample to balance attrition
for(f in 1:10){
set.seed(f)
att_rows=ee_nb_final[ee_nb_final$Attrition=="Yes",]
extrarows= sample(1:dim(att_rows)[1],round(.42143*dim(att_rows)[1]))
add_df=att_rows[extrarows,]
ee_nb_w_response=rbind(ee_nb_w_response,add_df)
}

#Loop to test Naive Bayes over multiple iterations
iterations= 20
masterAcc=matrix(nrow= iterations)
masterSens=matrix(nrow=iterations)
masterSpec=matrix(nrow=iterations)
splitPerc= .7 #training/test split percentage
 for(j in 1:iterations)
  {
    set.seed(j)
    trainIndices= sample(1:dim(ee_nb_w_response)[1],round(splitPerc*dim(ee_nb_w_response)[1]))
    train= ee_nb_w_response[trainIndices,]
    test=ee_nb_w_response[-trainIndices,]
    train_columns_split_model= train[,-(n+1)]
    test_columns= test[,-(n+1)]
    
    model= naiveBayes(train_columns_split_model,train$Attrition)
    CM= confusionMatrix(table(predict(model,test_columns), test$Attrition))
    masterAcc[j]=CM$overall[1]
    masterSens[j]=CM$byClass[1]
    masterSpec[j]=CM$byClass[2]
  }
MeanAcc_nb[n]=colMeans(masterAcc)
MeanSens_nb[n]=colMeans(masterSens)
MeanSpec_nb[n]= colMeans(masterSpec)
}
MeanAcc_nb
##            [,1]
##  [1,] 0.4872146
##  [2,] 0.6675799
##  [3,] 0.6985160
##  [4,] 0.7281963
##  [5,] 0.7143836
##  [6,] 0.7232877
##  [7,] 0.7281963
##  [8,] 0.7237443
##  [9,] 0.7214612
## [10,] 0.7147260
## [11,] 0.7012557
MeanSens_nb
##            [,1]
##  [1,] 0.6500000
##  [2,] 0.7678776
##  [3,] 0.7676154
##  [4,] 0.7296704
##  [5,] 0.7127203
##  [6,] 0.6960368
##  [7,] 0.7050290
##  [8,] 0.7064496
##  [9,] 0.6880368
## [10,] 0.6827533
## [11,] 0.6662602
MeanSpec_nb
##            [,1]
##  [1,] 0.3500000
##  [2,] 0.5682520
##  [3,] 0.6307902
##  [4,] 0.7274035
##  [5,] 0.7165578
##  [6,] 0.7506261
##  [7,] 0.7516524
##  [8,] 0.7413414
##  [9,] 0.7548895
## [10,] 0.7466924
## [11,] 0.7360456
m=MeanSens_nb+MeanSpec_nb
m
##           [,1]
##  [1,] 1.000000
##  [2,] 1.336130
##  [3,] 1.398406
##  [4,] 1.457074
##  [5,] 1.429278
##  [6,] 1.446663
##  [7,] 1.456681
##  [8,] 1.447791
##  [9,] 1.442926
## [10,] 1.429446
## [11,] 1.402306
#Visualize Results
plot(x = seq(1,length(MeanAcc_nb),1),y = MeanAcc_nb, ylim=c(0,1),type = "l", main= "NB Classification of Attrition based on Selected Variables", xlab= "Parameters", ylab= "Mean Acc (black) / Mean Spec (blue) / Mean Sens (red)",cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
lines(x = seq(1,length(MeanSens_nb),1),y = MeanSens_nb, ylim=c(0,1),type = "l", col="red")
lines(x = seq(1,length(MeanSpec_nb),1),y = MeanSpec_nb, ylim=c(0,1),type = "l", col="blue")

dim(ee_nb_final)
## [1] 870  12
str(ee_nb_w_response)
## 'data.frame':    1460 obs. of  12 variables:
##  $ StockOptionLevel    : num  0.252 -0.914 -0.914 1.418 -0.914 ...
##  $ JobInvolvement      : num  0.394 -1.028 0.394 0.394 0.394 ...
##  $ JobLevel            : num  -0.0358 2.7162 0.8815 0.8815 -0.9532 ...
##  $ YearsAtCompany      : num  -0.326 2.165 -0.824 1.169 -0.16 ...
##  $ YearsWithCurrManager: num  -0.319 1.36 -0.599 0.8 -0.319 ...
##  $ TotalWorkingYears   : num  -0.406 1.324 -0.14 0.392 -0.672 ...
##  $ Age                 : num  -0.541 0.355 -0.205 -0.541 -1.437 ...
##  $ YearsInCurrentRole  : num  -0.606 0.768 -0.606 1.592 -0.331 ...
##  $ OverTime            : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ JobRole             : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ MaritalStatus       : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ Attrition           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#Test final model on train data
model=naiveBayes(ee_nb_w_response[,-c(5:12)],ee_nb_w_response$Attrition)
CM= confusionMatrix(table(predict(model,ee_nb_w_response[,-12]),
                          ee_nb_w_response$Attrition))   
CM
## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  502 252
##   Yes 228 478
##                                           
##                Accuracy : 0.6712          
##                  95% CI : (0.6465, 0.6953)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.3425          
##                                           
##  Mcnemar's Test P-Value : 0.2938          
##                                           
##             Sensitivity : 0.6877          
##             Specificity : 0.6548          
##          Pos Pred Value : 0.6658          
##          Neg Pred Value : 0.6771          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3438          
##    Detection Prevalence : 0.5164          
##       Balanced Accuracy : 0.6712          
##                                           
##        'Positive' Class : No              
## 
#Create csv of predictions on test data and store predictions on GitHub
att_test=read.csv("~/Documents/SMU_DS/Doing Data Science/Case_Study_2_DDS/Raw_Data_Files/CaseStudy2CompSet No Attrition.csv",header = TRUE,stringsAsFactors = TRUE)
ref=ee_nb_w_response %>% select(-Attrition)
att_vars=att_test %>% select(all_of(variable.names(ref)))
variable.names(ee_nb_w_response)
##  [1] "StockOptionLevel"     "JobInvolvement"       "JobLevel"            
##  [4] "YearsAtCompany"       "YearsWithCurrManager" "TotalWorkingYears"   
##  [7] "Age"                  "YearsInCurrentRole"   "OverTime"            
## [10] "JobRole"              "MaritalStatus"        "Attrition"
model=naiveBayes(ee_nb_w_response[,-c(5:12)],ee_nb_w_response$Attrition)
att_preds=predict(model,att_vars) 
final_preds=data.frame(ID=att_test$ID,Attrition=att_preds)
summary(final_preds)
##        ID       Attrition
##  Min.   :1171   No :220  
##  1st Qu.:1246   Yes: 80  
##  Median :1320            
##  Mean   :1320            
##  3rd Qu.:1395            
##  Max.   :1470
summary(ee$Attrition)
##  No Yes 
## 730 140
summary(ee_nb_w_response$Attrition)
##  No Yes 
## 730 730
#write.csv(final_preds,"Case2PredictionsStewartAttrition.csv")

Naive Bayes allows categorical and continuous variables to be used. The variables were selected using a Chi-Squared Test with a p-value criteria of 0.0017. The plot tested adding one parameter at a time in the order of significance. The ideal result is using the top four parameters (Overtime, StockOptionLevel, JobRole, and JobInvolvement). They achieved a sensitivity of 0.6877 and a specificity of 0.6548.

Random Forest Analysis on Attrition

#Select Variables
ee_rf_options=ee[,-c(1,3,34:40)]
vars2=variable.names(ee[,-c(1,3,34:40)])
p_matrix= matrix(nrow=length(vars2), ncol=3)
for (i in 1:length(vars2)){
a=chisq.test(ee$Attrition,ee_rf_options[,(i)])
p_matrix[i,1]=vars2[i]
p_matrix[i,2]=a$p.value
p_matrix[i,3]=ifelse(a$p.value<=.3,"keep","remove")
}
## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect

## Warning in chisq.test(ee$Attrition, ee_rf_options[, (i)]): Chi-squared
## approximation may be incorrect
p_matrix
##       [,1]                       [,2]                   [,3]    
##  [1,] "Age"                      "0.000455925966956391" "keep"  
##  [2,] "BusinessTravel"           "0.0499253583464617"   "keep"  
##  [3,] "DailyRate"                "0.394359650210051"    "remove"
##  [4,] "Department"               "0.00942377313351953"  "keep"  
##  [5,] "DistanceFromHome"         "0.041015227721121"    "keep"  
##  [6,] "Education"                "0.624283776640735"    "remove"
##  [7,] "EducationField"           "0.268219767928874"    "keep"  
##  [8,] "EnvironmentSatisfaction"  "0.0105408976789624"   "keep"  
##  [9,] "Gender"                   "0.515129674660433"    "remove"
## [10,] "HourlyRate"               "0.165336227368764"    "keep"  
## [11,] "JobInvolvement"           "5.21104090021876e-09" "keep"  
## [12,] "JobLevel"                 "2.08470324574323e-08" "keep"  
## [13,] "JobRole"                  "3.64683624511309e-10" "keep"  
## [14,] "JobSatisfaction"          "0.0111512175494692"   "keep"  
## [15,] "MaritalStatus"            "3.37894621259824e-08" "keep"  
## [16,] "MonthlyIncome"            "0.548532282023289"    "remove"
## [17,] "MonthlyRate"              "0.536630277147251"    "remove"
## [18,] "NumCompaniesWorked"       "0.0167775870740727"   "keep"  
## [19,] "OverTime"                 "2.33298071371539e-15" "keep"  
## [20,] "PercentSalaryHike"        "0.277627401358292"    "keep"  
## [21,] "PerformanceRating"        "0.746170559364242"    "remove"
## [22,] "RelationshipSatisfaction" "0.372711655350202"    "remove"
## [23,] "StockOptionLevel"         "3.7244644710761e-12"  "keep"  
## [24,] "TotalWorkingYears"        "0.000407183992741349" "keep"  
## [25,] "TrainingTimesLastYear"    "0.119204353121313"    "keep"  
## [26,] "WorkLifeBalance"          "0.00249508979751358"  "keep"  
## [27,] "YearsAtCompany"           "7.89367480428311e-05" "keep"  
## [28,] "YearsInCurrentRole"       "0.000792886613459846" "keep"  
## [29,] "YearsSinceLastPromotion"  "0.129359276246726"    "keep"  
## [30,] "YearsWithCurrManager"     "0.000258107721542933" "keep"  
## [31,] "fctrPerformanceRating"    "0.746170559364242"    "remove"
p_vals=data.frame(p_matrix)
colnames(p_vals)=c("var","p_value","sig")
p_keep=p_vals %>% filter(sig=="keep")
p_keep$p_value=as.numeric(p_keep$p_value)
p_keep=p_keep %>% arrange(p_value)
p_keep[c(3,5,6,7,8),]
##                    var      p_value  sig
## 3              JobRole 3.646836e-10 keep
## 5             JobLevel 2.084703e-08 keep
## 6        MaritalStatus 3.378946e-08 keep
## 7       YearsAtCompany 7.893675e-05 keep
## 8 YearsWithCurrManager 2.581077e-04 keep
keep_len=dim(p_keep)[1]

#p_keep_select=head(p_keep,5)
#p_keep_vars=p_keep_select[,1]
p_keep_vars=p_keep[,1]

#Test RF
ee_rf=ee %>% select(all_of(p_keep_vars))
ee_rf_num=ee_rf%>% select_if(is.numeric)
ee_rf_fctr=ee_rf%>% select_if(is.factor)
ee_rf_num=scale(ee_rf_num)
ee_rf_w_response=cbind(ee_rf_num,ee_rf_fctr,Attrition=ee$Attrition)
iterations=5
masterAcc=matrix(nrow= iterations)
masterSens=matrix(nrow=iterations)
masterSpec=matrix(nrow=iterations)
a=c()
i=5

holder=ee_rf_w_response
for(f in 1:10){
set.seed(f)
att_rows=holder[holder$Attrition==1,]
extrarows= sample(1:dim(att_rows)[1],round(.42143*dim(att_rows)[1]))
add_df=att_rows[extrarows,]
ee_rf_w_response=rbind(ee_rf_w_response,add_df)
}

for (i in 1:iterations) {
    set.seed(i)
    trainIndices= sample(1:dim(ee_rf_w_response)[1],round(splitPerc*dim(ee_rf_w_response)[1]))
    train= ee_rf_w_response[trainIndices,]
    test=ee_rf_w_response[-trainIndices,]
    train_columns_split_model= train[,-dim(ee_rf_w_response)[2]]
    test_columns= test[,-dim(ee_rf_w_response)[2]]
  model <- randomForest(Attrition ~ ., data = train, ntree = 1000, mtry = i, importance = TRUE)
  predValid <- predict(model, test_columns, type = "class")
  CM= confusionMatrix(table(predValid, test$Attrition))
  #a[i] = mean(predValid == test$Attrition)
  masterAcc[i]=CM$overall[1]
  masterSens[i]=CM$byClass[1]
  masterSpec[i]=CM$byClass[2]
  }
MeanAcc_rf=colMeans(masterAcc)
MeanSens_rf=colMeans(masterSens)
MeanSpec_rf= colMeans(masterSpec)
MeanAcc_rf
## [1] 0.8582375
MeanSens_rf
## [1] 0.9946218
MeanSpec_rf
## [1] 0.09697308
#Visualize Results
plot(x = seq(1,iterations,1),y = masterAcc, ylim=c(0,1),type = "l", main= "NB Classification of Attrition", xlab= "k Value", ylab= "Mean Accuracy")
lines(x = seq(1,iterations,1),y = masterSens, ylim=c(0,1),type = "l", xlab= "k Value", col="red")
lines(x = seq(1,iterations,1),y = masterSpec, ylim=c(0,1),type = "l",  xlab= "k Value", col="blue")

 model <- randomForest(Attrition ~ ., data = train, ntree = 1500, mtry = 7, importance = TRUE)
 model$importance
##                                    No           Yes MeanDecreaseAccuracy
## StockOptionLevel         2.191667e-03  9.274061e-03         0.0032870995
## JobInvolvement           2.191694e-03  1.129934e-02         0.0036473073
## JobLevel                 3.873309e-03  1.147421e-02         0.0051202063
## YearsAtCompany           6.697009e-03  1.691399e-02         0.0082618003
## YearsWithCurrManager     4.366647e-03  1.089474e-02         0.0053665764
## TotalWorkingYears        5.835737e-03  7.500813e-03         0.0061235570
## Age                      3.533869e-03  1.314585e-02         0.0050255608
## YearsInCurrentRole       3.060616e-03  8.446561e-03         0.0039210255
## WorkLifeBalance          6.267695e-04  1.945619e-03         0.0008460357
## EnvironmentSatisfaction  1.047288e-03  4.763286e-03         0.0016642176
## JobSatisfaction          1.318489e-03  5.758538e-03         0.0020468737
## NumCompaniesWorked       1.269049e-03  1.149504e-03         0.0012447503
## DistanceFromHome         1.571207e-04  8.340183e-04         0.0002844521
## TrainingTimesLastYear    2.180285e-04 -4.220486e-05         0.0001858943
## YearsSinceLastPromotion  2.111187e-03 -7.843346e-04         0.0016057048
## HourlyRate              -8.809999e-05 -2.463588e-03        -0.0005020419
## PercentSalaryHike        2.224664e-04  2.054763e-03         0.0005175547
## OverTime                 8.779198e-03  4.181615e-02         0.0141316249
## JobRole                  6.550367e-03  1.024109e-02         0.0071215637
## MaritalStatus            2.350777e-03  1.020928e-02         0.0036464525
## Department               1.049103e-03  9.644016e-04         0.0010399215
## BusinessTravel          -1.289774e-04 -2.504549e-04        -0.0001374869
## EducationField           2.163567e-03  5.124491e-03         0.0026506974
##                         MeanDecreaseGini
## StockOptionLevel                6.740399
## JobInvolvement                  8.136381
## JobLevel                        5.183559
## YearsAtCompany                  9.755474
## YearsWithCurrManager            6.677067
## TotalWorkingYears               9.382286
## Age                            12.508573
## YearsInCurrentRole              6.069229
## WorkLifeBalance                 5.312414
## EnvironmentSatisfaction         6.000749
## JobSatisfaction                 6.317692
## NumCompaniesWorked              6.838441
## DistanceFromHome                8.782696
## TrainingTimesLastYear           5.341245
## YearsSinceLastPromotion         5.041925
## HourlyRate                      9.994680
## PercentSalaryHike               7.749202
## OverTime                       10.989015
## JobRole                        12.622837
## MaritalStatus                   5.020851
## Department                      1.961729
## BusinessTravel                  2.393429
## EducationField                  8.480301

The Random Forest model was an experiment on a model I had not used before. After experimenting with different mtry values and randomTree values, the highest specificity was less than 0.2 which is far below the 0.6 criteria. This model was not usable.

#Linear Regression to Predict Income

#Look at significance and correlations
res2 <- rcorr(as.matrix(ee_num[,-c(25:30)]))

str(res2$r)
##  num [1:24, 1:24] 1 -0.0401 -0.0329 0.0661 -0.0495 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:24] "ID" "Age" "DailyRate" "DistanceFromHome" ...
##   ..$ : chr [1:24] "ID" "Age" "DailyRate" "DistanceFromHome" ...
corrplot(res2$r, type="upper", order="hclust", 
         p.mat = res2$P, sig.level = 0.1, insig = "blank")

c1=data.frame(attributes (res2$P[,11]))
c2=data.frame(res2$P[,11])
c3=data.frame(abs(res2$r[,11]))
cor_p=cbind(c1,c2,c3)
colnames(cor_p)=c("vars","p_value","r_value")
cor_p=cor_p %>% arrange(desc(r_value))
cor_p$sig=ifelse(cor_p$p_value<=.01,'sig','not sig')
cor_p
##                                              vars      p_value      r_value
## MonthlyIncome                       MonthlyIncome           NA 1.000000e+00
## JobLevel                                 JobLevel 0.000000e+00 9.516400e-01
## TotalWorkingYears               TotalWorkingYears 0.000000e+00 7.785112e-01
## YearsAtCompany                     YearsAtCompany 0.000000e+00 4.913790e-01
## Age                                           Age 0.000000e+00 4.842883e-01
## YearsInCurrentRole             YearsInCurrentRole 0.000000e+00 3.618405e-01
## YearsWithCurrManager         YearsWithCurrManager 0.000000e+00 3.284875e-01
## YearsSinceLastPromotion   YearsSinceLastPromotion 0.000000e+00 3.159116e-01
## NumCompaniesWorked             NumCompaniesWorked 3.839539e-06 1.558943e-01
## Education                               Education 1.701095e-04 1.271314e-01
## MonthlyRate                           MonthlyRate 5.684494e-02 6.459407e-02
## PercentSalaryHike               PercentSalaryHike 1.123575e-01 5.386594e-02
## JobSatisfaction                   JobSatisfaction 1.172293e-01 5.314853e-02
## ID                                             ID 1.280196e-01 5.163903e-02
## PerformanceRating               PerformanceRating 2.037486e-01 4.313100e-02
## TrainingTimesLastYear       TrainingTimesLastYear 2.503303e-01 3.901457e-02
## WorkLifeBalance                   WorkLifeBalance 5.399821e-01 2.080489e-02
## StockOptionLevel                 StockOptionLevel 5.779471e-01 1.888872e-02
## EnvironmentSatisfaction   EnvironmentSatisfaction 6.190955e-01 1.687744e-02
## DistanceFromHome                 DistanceFromHome 8.443188e-01 6.667155e-03
## RelationshipSatisfaction RelationshipSatisfaction 9.089071e-01 3.884678e-03
## HourlyRate                             HourlyRate 9.438534e-01 2.391151e-03
## JobInvolvement                     JobInvolvement 9.894867e-01 4.473792e-04
## DailyRate                               DailyRate 9.979342e-01 8.790339e-05
##                              sig
## MonthlyIncome               <NA>
## JobLevel                     sig
## TotalWorkingYears            sig
## YearsAtCompany               sig
## Age                          sig
## YearsInCurrentRole           sig
## YearsWithCurrManager         sig
## YearsSinceLastPromotion      sig
## NumCompaniesWorked           sig
## Education                    sig
## MonthlyRate              not sig
## PercentSalaryHike        not sig
## JobSatisfaction          not sig
## ID                       not sig
## PerformanceRating        not sig
## TrainingTimesLastYear    not sig
## WorkLifeBalance          not sig
## StockOptionLevel         not sig
## EnvironmentSatisfaction  not sig
## DistanceFromHome         not sig
## RelationshipSatisfaction not sig
## HourlyRate               not sig
## JobInvolvement           not sig
## DailyRate                not sig
cor_p_select=cor_p %>% filter(sig=='sig'|vars=='MonthlyIncome' )
cor_p_select
##                                            vars      p_value   r_value  sig
## MonthlyIncome                     MonthlyIncome           NA 1.0000000 <NA>
## JobLevel                               JobLevel 0.000000e+00 0.9516400  sig
## TotalWorkingYears             TotalWorkingYears 0.000000e+00 0.7785112  sig
## YearsAtCompany                   YearsAtCompany 0.000000e+00 0.4913790  sig
## Age                                         Age 0.000000e+00 0.4842883  sig
## YearsInCurrentRole           YearsInCurrentRole 0.000000e+00 0.3618405  sig
## YearsWithCurrManager       YearsWithCurrManager 0.000000e+00 0.3284875  sig
## YearsSinceLastPromotion YearsSinceLastPromotion 0.000000e+00 0.3159116  sig
## NumCompaniesWorked           NumCompaniesWorked 3.839539e-06 0.1558943  sig
## Education                             Education 1.701095e-04 0.1271314  sig
cor_p_select=cor_p_select[,1]
eekeep=ee %>% select(all_of(cor_p_select))
eekeep_len=length(eekeep)-1
par(mfrow=c(4,3))
# for(i in 1:eekeep_len){
#   plot(eekeep$MonthlyIncome,eekeep[,(i+1)])
# }

#Looking at untransformed scatterplots
eegath=eekeep %>%
  as_data_frame() %>%
  gather(key = "variable", value = "value",
         -MonthlyIncome)
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ggplot(eegath, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='loess')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 3
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1629e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2.02
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.1629e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4

ggplot(eegath, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='lm')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'

#Looking at log(MonthlyIncome) scatter
eekeep_logy=eekeep
eekeep_logy$MonthlyIncome=log(eekeep_logy$MonthlyIncome)

eegath2=eekeep_logy %>%
  as_data_frame() %>%
  gather(key = "variable", value = "value",
         -MonthlyIncome)

ggplot(eegath2, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='lm')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'

ggplot(eegath2, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='loess')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 3
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1629e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2.02
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.1629e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4

#Looking at log(MonthlyIncome) and log all variables scatter
eekeep_logall=eekeep
len=length(eekeep_logall)
for(i in 1:len){
  eekeep_logall[,i]=log(eekeep_logall[,i]+.01)
}
str(eekeep_logall)
## 'data.frame':    870 obs. of  10 variables:
##  $ MonthlyIncome          : num  8.39 9.88 9.14 9.25 8.23 ...
##  $ JobLevel               : num  0.69813 1.61144 1.10194 1.10194 0.00995 ...
##  $ TotalWorkingYears      : num  2.08 3.04 2.3 2.64 1.79 ...
##  $ YearsAtCompany         : num  1.611 2.996 0.698 2.64 1.793 ...
##  $ Age                    : num  3.47 3.69 3.56 3.47 3.18 ...
##  $ YearsInCurrentRole     : num  0.698 1.947 0.698 2.304 1.102 ...
##  $ YearsWithCurrManager   : num  1.102 2.198 0.698 1.947 1.102 ...
##  $ YearsSinceLastPromotion: num  -4.60517 1.38879 0.69813 1.61144 0.00995 ...
##  $ NumCompaniesWorked     : num  0.69813 0.00995 0.69813 0.00995 0.00995 ...
##  $ Education              : num  1.38879 1.10194 0.69813 1.38879 0.00995 ...
eegath3=eekeep_logall %>%
  as_data_frame() %>%
  gather(key = "variable", value = "value",
         -MonthlyIncome)

ggplot(eegath3, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='lm')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'

#Looking at the MonthlyIncome and log of all x values
eekeep_logx=eekeep
len=length(eekeep_logx)-1
for(i in 1:len){
  eekeep_logx[,i+1]=log(eekeep_logx[,i+1]+.01)
}
str(eekeep_logx)
## 'data.frame':    870 obs. of  10 variables:
##  $ MonthlyIncome          : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ JobLevel               : num  0.69813 1.61144 1.10194 1.10194 0.00995 ...
##  $ TotalWorkingYears      : num  2.08 3.04 2.3 2.64 1.79 ...
##  $ YearsAtCompany         : num  1.611 2.996 0.698 2.64 1.793 ...
##  $ Age                    : num  3.47 3.69 3.56 3.47 3.18 ...
##  $ YearsInCurrentRole     : num  0.698 1.947 0.698 2.304 1.102 ...
##  $ YearsWithCurrManager   : num  1.102 2.198 0.698 1.947 1.102 ...
##  $ YearsSinceLastPromotion: num  -4.60517 1.38879 0.69813 1.61144 0.00995 ...
##  $ NumCompaniesWorked     : num  0.69813 0.00995 0.69813 0.00995 0.00995 ...
##  $ Education              : num  1.38879 1.10194 0.69813 1.38879 0.00995 ...
eegath3=eekeep_logx %>%
  as_data_frame() %>%
  gather(key = "variable", value = "value",
         -MonthlyIncome)

ggplot(eegath3, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='lm')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'

ggplot(eegath3, aes(x = value, y = MonthlyIncome)) +
  geom_point(aes()) +
  geom_smooth(method='loess')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 1.1019
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.40381
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.7084e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 1.1019
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.40381
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.7084e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.0019429
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 5.8987e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.1924
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.0019429
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 1.1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.8987e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1.1924

str(ee_fctr)
## 'data.frame':    870 obs. of  10 variables:
##  $ Attrition            : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel       : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ Department           : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ EducationField       : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ Gender               : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ JobRole              : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ MaritalStatus        : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ OverTime             : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ fctrPerformanceRating: Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
##  $ DistanceFromHome_cat : Factor w/ 4 levels "Close","Far",..: 2 2 2 3 3 2 1 2 2 2 ...
#fit the model with selected variables
par(mfrow=c(2,2))
fit1=glm(MonthlyIncome~I(log(JobLevel)^2)+I(TotalWorkingYears^2)+I(YearsInCurrentRole^3)+I(Education^2)+log(JobLevel)+.-Age-YearsSinceLastPromotion-YearsInCurrentRole-NumCompaniesWorked-YearsWithCurrManager-Education-JobLevel,data = eekeep)
summary(fit1)
## 
## Call:
## glm(formula = MonthlyIncome ~ I(log(JobLevel)^2) + I(TotalWorkingYears^2) + 
##     I(YearsInCurrentRole^3) + I(Education^2) + log(JobLevel) + 
##     . - Age - YearsSinceLastPromotion - YearsInCurrentRole - 
##     NumCompaniesWorked - YearsWithCurrManager - Education - JobLevel, 
##     data = eekeep)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5254.3   -686.4   -106.6    603.6   4716.2  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2.278e+03  1.309e+02  17.401  < 2e-16 ***
## I(log(JobLevel)^2)       6.888e+03  2.145e+02  32.107  < 2e-16 ***
## I(TotalWorkingYears^2)  -1.860e+00  6.364e-01  -2.923  0.00356 ** 
## I(YearsInCurrentRole^3)  1.613e-01  8.774e-02   1.838  0.06638 .  
## I(Education^2)           8.148e+00  7.457e+00   1.093  0.27490    
## log(JobLevel)           -1.243e+03  2.720e+02  -4.572 5.54e-06 ***
## TotalWorkingYears        1.041e+02  2.170e+01   4.796 1.90e-06 ***
## YearsAtCompany          -3.284e+01  1.061e+01  -3.096  0.00202 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1560633)
## 
##     Null deviance: 1.8370e+10  on 869  degrees of freedom
## Residual deviance: 1.3453e+09  on 862  degrees of freedom
## AIC: 14886
## 
## Number of Fisher Scoring iterations: 2
plot(fit1)

ee_nums_w_cat_options=cbind(MonthlyIncome=ee$MonthlyIncome,JobLevel=ee$JobLevel,TotalWorkingYears=ee$TotalWorkingYears,YearsInCurrentRole=ee$YearsInCurrentRole,Education=ee$Education,YearsAtCompany=ee$YearsAtCompany,ee_fctr)

fit2=glm(MonthlyIncome~I(log(JobLevel)^2)+I(TotalWorkingYears^2)+I(Education^2)+I(YearsInCurrentRole^3)+log(JobLevel)+JobRole*TotalWorkingYears+.-YearsInCurrentRole-Education-JobLevel
          -Attrition-BusinessTravel-Department-EducationField-Gender-MaritalStatus-OverTime-fctrPerformanceRating-DistanceFromHome_cat
          ,data = ee_nums_w_cat_options,)
summary(fit2)
## 
## Call:
## glm(formula = MonthlyIncome ~ I(log(JobLevel)^2) + I(TotalWorkingYears^2) + 
##     I(Education^2) + I(YearsInCurrentRole^3) + log(JobLevel) + 
##     JobRole * TotalWorkingYears + . - YearsInCurrentRole - Education - 
##     JobLevel - Attrition - BusinessTravel - Department - EducationField - 
##     Gender - MaritalStatus - OverTime - fctrPerformanceRating - 
##     DistanceFromHome_cat, data = ee_nums_w_cat_options)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2918.1   -602.4    -88.5    548.6   4494.9  
## 
## Coefficients:
##                                                   Estimate Std. Error t value
## (Intercept)                                     2358.73104  349.41640   6.750
## I(log(JobLevel)^2)                              4629.89531  242.76374  19.072
## I(TotalWorkingYears^2)                            -3.72633    0.70555  -5.281
## I(Education^2)                                    -2.39343    5.97168  -0.401
## I(YearsInCurrentRole^3)                            0.07745    0.07046   1.099
## log(JobLevel)                                   -503.10157  344.61383  -1.460
## JobRoleHuman Resources                            88.04394  544.48865   0.162
## JobRoleLaboratory Technician                     150.61356  360.92893   0.417
## JobRoleManager                                  2789.93845  657.29350   4.245
## JobRoleManufacturing Director                     27.26143  339.67603   0.080
## JobRoleResearch Director                        3389.46125  503.65723   6.730
## JobRoleResearch Scientist                        -26.64747  364.37632  -0.073
## JobRoleSales Executive                           230.02934  302.17781   0.761
## JobRoleSales Representative                     -148.27843  384.96350  -0.385
## TotalWorkingYears                                197.24329   31.56686   6.248
## YearsAtCompany                                    -7.93561    8.63699  -0.919
## JobRoleHuman Resources:TotalWorkingYears        -107.48209   65.72100  -1.635
## JobRoleLaboratory Technician:TotalWorkingYears  -125.94409   26.22448  -4.803
## JobRoleManager:TotalWorkingYears                  19.78787   28.41280   0.696
## JobRoleManufacturing Director:TotalWorkingYears   10.11299   23.24741   0.435
## JobRoleResearch Director:TotalWorkingYears        -0.22683   25.45611  -0.009
## JobRoleResearch Scientist:TotalWorkingYears      -76.69482   26.59239  -2.884
## JobRoleSales Executive:TotalWorkingYears         -17.68582   21.34505  -0.829
## JobRoleSales Representative:TotalWorkingYears    -78.64200   36.47677  -2.156
##                                                 Pr(>|t|)    
## (Intercept)                                     2.73e-11 ***
## I(log(JobLevel)^2)                               < 2e-16 ***
## I(TotalWorkingYears^2)                          1.63e-07 ***
## I(Education^2)                                   0.68867    
## I(YearsInCurrentRole^3)                          0.27203    
## log(JobLevel)                                    0.14469    
## JobRoleHuman Resources                           0.87158    
## JobRoleLaboratory Technician                     0.67657    
## JobRoleManager                                  2.43e-05 ***
## JobRoleManufacturing Director                    0.93605    
## JobRoleResearch Director                        3.13e-11 ***
## JobRoleResearch Scientist                        0.94172    
## JobRoleSales Executive                           0.44673    
## JobRoleSales Representative                      0.70020    
## TotalWorkingYears                               6.56e-10 ***
## YearsAtCompany                                   0.35847    
## JobRoleHuman Resources:TotalWorkingYears         0.10233    
## JobRoleLaboratory Technician:TotalWorkingYears  1.85e-06 ***
## JobRoleManager:TotalWorkingYears                 0.48634    
## JobRoleManufacturing Director:TotalWorkingYears  0.66366    
## JobRoleResearch Director:TotalWorkingYears       0.99289    
## JobRoleResearch Scientist:TotalWorkingYears      0.00403 ** 
## JobRoleSales Executive:TotalWorkingYears         0.40758    
## JobRoleSales Representative:TotalWorkingYears    0.03137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 973521.6)
## 
##     Null deviance: 1.837e+10  on 869  degrees of freedom
## Residual deviance: 8.236e+08  on 846  degrees of freedom
## AIC: 14491
## 
## Number of Fisher Scoring iterations: 2
plot(fit2)

RSS <- c(crossprod(fit2$residuals))
MSE <- RSS / length(fit2$residuals)
RMSE <- sqrt(MSE)
RMSE
## [1] 972.9676
e=ee_nums_w_cat_options
e$JobLevel=factor(e$JobLevel)
fit_alt=glm(MonthlyIncome~I(TotalWorkingYears^2)+I(Education^2)+I(YearsInCurrentRole^3)+JobRole*TotalWorkingYears+JobLevel*JobRole+.-YearsInCurrentRole-Education-Attrition-BusinessTravel-Department-EducationField-Gender-MaritalStatus-OverTime-fctrPerformanceRating-DistanceFromHome_cat
          ,data = e)
summary(fit_alt)
## 
## Call:
## glm(formula = MonthlyIncome ~ I(TotalWorkingYears^2) + I(Education^2) + 
##     I(YearsInCurrentRole^3) + JobRole * TotalWorkingYears + JobLevel * 
##     JobRole + . - YearsInCurrentRole - Education - Attrition - 
##     BusinessTravel - Department - EducationField - Gender - MaritalStatus - 
##     OverTime - fctrPerformanceRating - DistanceFromHome_cat, 
##     data = e)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2912.6   -585.0   -108.0    535.9   4477.7  
## 
## Coefficients: (19 not defined because of singularities)
##                                                   Estimate Std. Error t value
## (Intercept)                                     -1.849e+02  1.190e+03  -0.155
## I(TotalWorkingYears^2)                          -3.858e+00  8.002e-01  -4.821
## I(Education^2)                                  -3.571e+00  6.007e+00  -0.595
## I(YearsInCurrentRole^3)                          6.078e-02  7.082e-02   0.858
## JobRoleHuman Resources                           2.733e+03  1.258e+03   2.173
## JobRoleLaboratory Technician                     2.547e+03  1.185e+03   2.150
## JobRoleManager                                   3.719e+03  1.220e+03   3.048
## JobRoleManufacturing Director                    1.240e+03  9.831e+02   1.261
## JobRoleResearch Director                         4.544e+03  9.756e+02   4.657
## JobRoleResearch Scientist                        2.618e+03  1.213e+03   2.158
## JobRoleSales Executive                           1.633e+03  8.943e+02   1.826
## JobRoleSales Representative                      2.394e+03  1.193e+03   2.006
## TotalWorkingYears                                2.163e+02  3.587e+01   6.029
## JobLevel2                                        4.282e+03  1.106e+03   3.872
## JobLevel3                                        7.262e+03  1.071e+03   6.783
## JobLevel4                                        9.932e+03  1.196e+03   8.307
## JobLevel5                                        1.252e+04  1.245e+03  10.051
## YearsAtCompany                                  -4.358e+00  8.763e+00  -0.497
## JobRoleHuman Resources:TotalWorkingYears        -1.561e+02  7.489e+01  -2.085
## JobRoleLaboratory Technician:TotalWorkingYears  -1.096e+02  3.269e+01  -3.352
## JobRoleManager:TotalWorkingYears                 1.133e+01  4.012e+01   0.282
## JobRoleManufacturing Director:TotalWorkingYears -1.415e+01  3.087e+01  -0.458
## JobRoleResearch Director:TotalWorkingYears      -1.974e+01  3.332e+01  -0.592
## JobRoleResearch Scientist:TotalWorkingYears     -1.172e+02  3.143e+01  -3.729
## JobRoleSales Executive:TotalWorkingYears        -4.557e+01  2.732e+01  -1.668
## JobRoleSales Representative:TotalWorkingYears   -9.700e+01  4.041e+01  -2.400
## JobRoleHuman Resources:JobLevel2                -2.000e+03  1.360e+03  -1.471
## JobRoleLaboratory Technician:JobLevel2          -2.825e+03  1.129e+03  -2.501
## JobRoleManager:JobLevel2                                NA         NA      NA
## JobRoleManufacturing Director:JobLevel2         -1.025e+03  7.664e+02  -1.338
## JobRoleResearch Director:JobLevel2                      NA         NA      NA
## JobRoleResearch Scientist:JobLevel2             -2.069e+03  1.084e+03  -1.908
## JobRoleSales Executive:JobLevel2                -1.193e+03  7.069e+02  -1.688
## JobRoleSales Representative:JobLevel2           -2.367e+03  1.281e+03  -1.848
## JobRoleHuman Resources:JobLevel3                -1.556e+03  1.326e+03  -1.174
## JobRoleLaboratory Technician:JobLevel3          -3.761e+03  1.268e+03  -2.966
## JobRoleManager:JobLevel3                        -6.751e+02  9.418e+02  -0.717
## JobRoleManufacturing Director:JobLevel3         -8.401e+02  7.145e+02  -1.176
## JobRoleResearch Director:JobLevel3              -8.888e+02  6.571e+02  -1.353
## JobRoleResearch Scientist:JobLevel3                     NA         NA      NA
## JobRoleSales Executive:JobLevel3                -9.501e+02  6.700e+02  -1.418
## JobRoleSales Representative:JobLevel3                   NA         NA      NA
## JobRoleHuman Resources:JobLevel4                        NA         NA      NA
## JobRoleLaboratory Technician:JobLevel4                  NA         NA      NA
## JobRoleManager:JobLevel4                        -2.716e+02  4.711e+02  -0.577
## JobRoleManufacturing Director:JobLevel4                 NA         NA      NA
## JobRoleResearch Director:JobLevel4                      NA         NA      NA
## JobRoleResearch Scientist:JobLevel4                     NA         NA      NA
## JobRoleSales Executive:JobLevel4                        NA         NA      NA
## JobRoleSales Representative:JobLevel4                   NA         NA      NA
## JobRoleHuman Resources:JobLevel5                        NA         NA      NA
## JobRoleLaboratory Technician:JobLevel5                  NA         NA      NA
## JobRoleManager:JobLevel5                                NA         NA      NA
## JobRoleManufacturing Director:JobLevel5                 NA         NA      NA
## JobRoleResearch Director:JobLevel5                      NA         NA      NA
## JobRoleResearch Scientist:JobLevel5                     NA         NA      NA
## JobRoleSales Executive:JobLevel5                        NA         NA      NA
## JobRoleSales Representative:JobLevel5                   NA         NA      NA
##                                                 Pr(>|t|)    
## (Intercept)                                     0.876585    
## I(TotalWorkingYears^2)                          1.70e-06 ***
## I(Education^2)                                  0.552324    
## I(YearsInCurrentRole^3)                         0.391057    
## JobRoleHuman Resources                          0.030085 *  
## JobRoleLaboratory Technician                    0.031861 *  
## JobRoleManager                                  0.002375 ** 
## JobRoleManufacturing Director                   0.207659    
## JobRoleResearch Director                        3.73e-06 ***
## JobRoleResearch Scientist                       0.031207 *  
## JobRoleSales Executive                          0.068199 .  
## JobRoleSales Representative                     0.045145 *  
## TotalWorkingYears                               2.47e-09 ***
## JobLevel2                                       0.000116 ***
## JobLevel3                                       2.23e-11 ***
## JobLevel4                                       3.98e-16 ***
## JobLevel5                                        < 2e-16 ***
## YearsAtCompany                                  0.619134    
## JobRoleHuman Resources:TotalWorkingYears        0.037417 *  
## JobRoleLaboratory Technician:TotalWorkingYears  0.000838 ***
## JobRoleManager:TotalWorkingYears                0.777675    
## JobRoleManufacturing Director:TotalWorkingYears 0.646881    
## JobRoleResearch Director:TotalWorkingYears      0.553790    
## JobRoleResearch Scientist:TotalWorkingYears     0.000205 ***
## JobRoleSales Executive:TotalWorkingYears        0.095745 .  
## JobRoleSales Representative:TotalWorkingYears   0.016607 *  
## JobRoleHuman Resources:JobLevel2                0.141765    
## JobRoleLaboratory Technician:JobLevel2          0.012558 *  
## JobRoleManager:JobLevel2                              NA    
## JobRoleManufacturing Director:JobLevel2         0.181384    
## JobRoleResearch Director:JobLevel2                    NA    
## JobRoleResearch Scientist:JobLevel2             0.056747 .  
## JobRoleSales Executive:JobLevel2                0.091730 .  
## JobRoleSales Representative:JobLevel2           0.065008 .  
## JobRoleHuman Resources:JobLevel3                0.240839    
## JobRoleLaboratory Technician:JobLevel3          0.003107 ** 
## JobRoleManager:JobLevel3                        0.473682    
## JobRoleManufacturing Director:JobLevel3         0.240048    
## JobRoleResearch Director:JobLevel3              0.176552    
## JobRoleResearch Scientist:JobLevel3                   NA    
## JobRoleSales Executive:JobLevel3                0.156543    
## JobRoleSales Representative:JobLevel3                 NA    
## JobRoleHuman Resources:JobLevel4                      NA    
## JobRoleLaboratory Technician:JobLevel4                NA    
## JobRoleManager:JobLevel4                        0.564372    
## JobRoleManufacturing Director:JobLevel4               NA    
## JobRoleResearch Director:JobLevel4                    NA    
## JobRoleResearch Scientist:JobLevel4                   NA    
## JobRoleSales Executive:JobLevel4                      NA    
## JobRoleSales Representative:JobLevel4                 NA    
## JobRoleHuman Resources:JobLevel5                      NA    
## JobRoleLaboratory Technician:JobLevel5                NA    
## JobRoleManager:JobLevel5                              NA    
## JobRoleManufacturing Director:JobLevel5               NA    
## JobRoleResearch Director:JobLevel5                    NA    
## JobRoleResearch Scientist:JobLevel5                   NA    
## JobRoleSales Executive:JobLevel5                      NA    
## JobRoleSales Representative:JobLevel5                 NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 970373.4)
## 
##     Null deviance: 1.8370e+10  on 869  degrees of freedom
## Residual deviance: 8.0638e+08  on 831  degrees of freedom
## AIC: 14502
## 
## Number of Fisher Scoring iterations: 2
plot(fit_alt)
## Warning: not plotting observations with leverage one:
##   523

#visualizing JobRole by the numerical variables remaining
ee_final=data.frame(MonthlyIncome=ee$MonthlyIncome,JobLevel=ee$JobLevel,TotalWorkingYears=ee$TotalWorkingYears,YearsInCurrentRole=ee$YearsInCurrentRole,Education=ee$Education,YearsAtCompany=ee$YearsAtCompany,JobRole=ee$JobRole)

str(ee_final)
## 'data.frame':    870 obs. of  7 variables:
##  $ MonthlyIncome     : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ JobLevel          : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ TotalWorkingYears : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ YearsInCurrentRole: int  2 7 2 10 3 7 2 0 1 2 ...
##  $ Education         : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ YearsAtCompany    : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ JobRole           : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
eegath3=ee_final %>%
  as_data_frame() %>%
  gather(key = "variable", value = "value",
         -MonthlyIncome,-JobRole)

ggplot(eegath3, aes(x = value, y = MonthlyIncome, col=JobRole)) +
  geom_point(aes()) +
  geom_smooth(method='lm')+
  facet_wrap(~variable,scales='free')
## `geom_smooth()` using formula 'y ~ x'

#This is the final model with the lowest AIC
fit3=glm(MonthlyIncome~I(log(JobLevel)^2)+log(JobLevel)+I(TotalWorkingYears^2)+TotalWorkingYears+JobRole*TotalWorkingYears+.-YearsAtCompany-YearsInCurrentRole-Education-JobLevel,data = ee_final)
summary(fit3)
## 
## Call:
## glm(formula = MonthlyIncome ~ I(log(JobLevel)^2) + log(JobLevel) + 
##     I(TotalWorkingYears^2) + TotalWorkingYears + JobRole * TotalWorkingYears + 
##     . - YearsAtCompany - YearsInCurrentRole - Education - JobLevel, 
##     data = ee_final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2919.9   -602.5    -89.9    556.0   4502.9  
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                     2358.7147   345.9016   6.819
## I(log(JobLevel)^2)                              4623.3690   241.9862  19.106
## log(JobLevel)                                   -512.7206   343.3719  -1.493
## I(TotalWorkingYears^2)                            -3.7071     0.7044  -5.263
## TotalWorkingYears                                192.9223    30.9803   6.227
## JobRoleHuman Resources                            56.6393   543.0945   0.104
## JobRoleLaboratory Technician                     116.7663   358.8957   0.325
## JobRoleManager                                  2773.8790   656.4915   4.225
## JobRoleManufacturing Director                    -16.5977   336.3977  -0.049
## JobRoleResearch Director                        3354.4335   501.0200   6.695
## JobRoleResearch Scientist                        -64.5418   361.9981  -0.178
## JobRoleSales Executive                           187.8560   299.7262   0.627
## JobRoleSales Representative                     -171.9492   383.0521  -0.449
## TotalWorkingYears:JobRoleHuman Resources        -105.5045    65.6387  -1.607
## TotalWorkingYears:JobRoleLaboratory Technician  -123.2451    26.0996  -4.722
## TotalWorkingYears:JobRoleManager                  22.3097    28.2718   0.789
## TotalWorkingYears:JobRoleManufacturing Director   14.5206    22.8600   0.635
## TotalWorkingYears:JobRoleResearch Director         3.3896    25.1079   0.135
## TotalWorkingYears:JobRoleResearch Scientist      -74.1409    26.4507  -2.803
## TotalWorkingYears:JobRoleSales Executive         -14.2372    21.1310  -0.674
## TotalWorkingYears:JobRoleSales Representative    -77.1937    36.3547  -2.123
##                                                 Pr(>|t|)    
## (Intercept)                                     1.74e-11 ***
## I(log(JobLevel)^2)                               < 2e-16 ***
## log(JobLevel)                                    0.13576    
## I(TotalWorkingYears^2)                          1.80e-07 ***
## TotalWorkingYears                               7.46e-10 ***
## JobRoleHuman Resources                           0.91696    
## JobRoleLaboratory Technician                     0.74500    
## JobRoleManager                                  2.64e-05 ***
## JobRoleManufacturing Director                    0.96066    
## JobRoleResearch Director                        3.91e-11 ***
## JobRoleResearch Scientist                        0.85854    
## JobRoleSales Executive                           0.53099    
## JobRoleSales Representative                      0.65362    
## TotalWorkingYears:JobRoleHuman Resources         0.10835    
## TotalWorkingYears:JobRoleLaboratory Technician  2.73e-06 ***
## TotalWorkingYears:JobRoleManager                 0.43027    
## TotalWorkingYears:JobRoleManufacturing Director  0.52547    
## TotalWorkingYears:JobRoleResearch Director       0.89264    
## TotalWorkingYears:JobRoleResearch Scientist      0.00518 ** 
## TotalWorkingYears:JobRoleSales Executive         0.50065    
## TotalWorkingYears:JobRoleSales Representative    0.03401 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 971831)
## 
##     Null deviance: 1.8370e+10  on 869  degrees of freedom
## Residual deviance: 8.2508e+08  on 849  degrees of freedom
## AIC: 14486
## 
## Number of Fisher Scoring iterations: 2
par(mfrow=c(3,2))
plot(fit3)
RSS <- c(crossprod(fit3$residuals))
MSE <- RSS / length(fit3$residuals)
RMSE <- sqrt(MSE)
RMSE
## [1] 973.8445
#####Predicting the test values
# income_test = read.csv("Raw_Data_Files/CaseStudy2CompSet No Salary.csv",header = TRUE,stringsAsFactors = TRUE)
# ref=ee_final %>% select(-MonthlyIncome)
# ref=cbind(ID=ee$ID,ref)
# income_test=income_test %>% select(all_of(variable.names(ref)))
# str(income_test)
# preds=predict(fit3,income_test)
# preds=unlist(preds, use.names=FALSE)
# income_preds=data.frame(ID=income_test$ID,MonthlyIncome=preds)
# income_preds
# summary(income_preds)
# summary(ee$MonthlyIncome)
#write.csv(income_preds,"Case2PredictionsStewartSalary.csv")

Using multiple linear regression, we were able to predict Monthly Income. To measure the significance of numerical variables to Monthly Income, a t-test was used. To measure the significance of categorical variables to Monthly Income, a Chi-Squared Test was used. Iterating through combinations of possible models, the final model (fit3) was found to have the lowest AIC. The RMSE = 973.8445.

Conclusion

Overall, we evaluated Job Role against Job Satisfaction and Monthly Income. We identified key variables and predicted Attrition using Naive Bayes. We identified key variables and predicted Monthly Income using Multiple Linear Regression.

Next Steps: We need to understand if the submitted test predictions have a fit similar to our model. If so, we can proceed with implementation. We should also evaluate future opportunities to leverage data science within Frito Lay.